Compressed Video Indexing and Retrieval 
System using Two Dimensional 
Representation of Color 


by 

M. Anjaneya Prasad 



DEPARTMENT OF ELECTRICAL ENGINEERING 
INDIAN INSTITUTE OF TECHNOLOGY, KANPUR 
June 2004 



Compressed Video Indexing and Retrieval 
System using Two Dimensional 
Representation of Color 


A Thesis Submitted 

in Partial Fulfillment of the Requirements 
for the Degree of 
Master of Technology 


by 

M. Anjaneya Prasad 



to the 

DEPARTMENT OF ELECTRICAL ENGINEERING 
INDIAN INSTITUTE OF TECHNOLOGY, KANPUR 
June 2004 



2 I StP 2004 

vrnrfVzT sfT;-itfTf?r «>«Tpr 

ar© A . . Li8.8.3A.Z^ 


“7 k 






THESIS ri’n"/i!TTEQ 


4 y 


■A i 


11 


CERTIFICATE 

It is certified that the work contained in the thesis entitled “Compressed Video Index- 
ing and Retrieval System using Two Dimensional Representation of Color” by M. Anjaneya 
Prasad has been carried out under my supervision and that this work has not been submitted 
elsewhere for a degree. 


June 2004 


Dr. Sun^paGupta 
Professor, 

Department of Electrical Engineering, 
Indian Institute of Technology, 
Kanpur-208016. 


Ill 


Acknowledgements 

At first, I would like to express my sincere gratitude to my thesis supervisor. Dr. Sumana 
Gupta for her continuous encouragement and invaluable advices. Throughout my thesis, she 
has been patient enough to listen carefully to my problems, provide suggestions and relevant 
reference materials. Without her help, it would not have been possible for me to complete 
my thesis in time and also to achieve success in my personal life. 

I would like to thank the professors in Electrical Engineering Department, who have 
helped me to develop the basic understanding in signal-processing and communication sub- 
jects. I also thank the office staff members for the help I have received during my stay here. 

I am thankful to Tej for his suggestions regarding implementation issues. Special thanks 
goes to Baju for the discussions we had throughout my thesis. I would like to thank Swagat 
for his help during documentation. I am thankful to Dinesh, Ravi, Bapi, Behera and Padlikar 
for their support during my stay over here. 

I am indebted to my parents and brother for their constant support and encouragement. 
Last but not least I owe to God. Without his blessings nothing would have been possible in 
my life. 


Anjaneya Prasad 



IV 


Abstract 

One of the challenging problems in creating multimedia database is the organization of 
the visual information. Since video requires large amounts of storage and processing, effi- 
cient indexing and retrieval of video has become a necessity. Content based video indexing 
and retrieval systems use visual features like color. Color based video indexing and retrieval 
methods proposed so-far use 3D representation of color. In this thesis, we propose a new 
method for indexing and retrieval of video using 2D representation of color. This repre- 
sentation reduces the retrieval time significantly. A method of mapping of color from three 
dimensional space to two dimensional space is discussed. Video indexing tools developed 
support automatic segmentation of video, identification of keyframes, keyframe clustering 
and extraction of visual features. These visual features are used for efficient video retrieval. 
The proposed video indexing method uses DC frames of MPEG compressed bit streams. 
Abrupt scene changes as well as special editing effects such as dissolves and fades are de- 
tected accurately. A new method for keyframe clustering is proposed which reduces the 
redundancies in the keyframes. Color layout descriptor is used to extract the indices of the 
keyframes and to retrieve the video segments efficiently. Experimental results obtained prove 
that retrieval time is significantly reduced using the proposed 2D representation of color. 
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Chapter 1 


Introduction 

1.1 Need for video indexing 

Multimedia infonnation systems are becoming increasingly important with the advent of 
broadband networks, high-powered work stations and compression standards. Infonnation 
databases have evolved from simple text to multimedia with video, audio and text. Since 
visual media requires large amounts of storage and processing, there is a need to efficiently 
store, index and retrieve the visual information from multimedia database. In this contextual 
framework, indexing of video is similar to text indexing or book marking while retrieval or 
searching of this video snaps or objects is similar to paragraph searching in the conventional 
textual database franiework. 

Video indexing basically involves segmentation of video into identifiable partitions called 
shots by determining the positions of significant scene changes. It also involves the deter- 
mination and classification of various video editing effects like dissolves and fades, and 
selection of representative frames(keyframes) to represent each shot. Indexing also involves 
keyframe clustering, which reduces the redundancies in keyframes. These redundancies in 
keyframes are due to visual similarities in different shots of a video. These keyframes form 
the basis for indexing each video partition(shot). Further operations such as, video brows- 
ing and content-based retrieval of video can then be performed using, only these keyframes. 
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instead of the entire video. 

Video sequences are usually stored and transmitted in compressed format. Hence video 
indexing can be done in the compressed domain itself or the sequence can first be decom- 
pressed and then used for video indexing. Working in compressed domain has the following 
advantages: less computational complexity, lower storage space and operations are faster. 


1.2 Retrieval 


One of the main application of video indexing is to search the video by its content. The 
user poses queries to the large database of videos and a fast, efficient, and precise reply is 
required. In a typical query, the user provides an image or another video and expects the 
system to retrieve similar clips (i.e. a “query-by-example”). Perhaps, the most challenging 
questions in video retrieval are: “what are the good sets of features to represent the video?” 
and “what is the good measure of visual similarity?” The overall system of video indexing 
and retrieval are shown in Fig. 1.1 



Figure 1.1: Block diagram of indexing and retrieval 
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1.3 Existing color indexing approaches 


Content-based video indexing and retrieval systems use features like color, texture and mo- 
tion information. Color is one of the most widely used visual features in image and video 
retrieval. Color features are relatively robust to changes in the background colors and are 
independent of image size and orientation. In this thesis we have used color for indexing and 
retrieval. 

One common method of characterizing color is to use color histograms. Ferman et al [2] 
proposed histogram based color descriptors, to capture the color properties of multiple im- 
ages or group of frames. They proposed alpha-trimmed average histograms and intersection 
histograms to represent group of frames. Kasutani and Yamada [3] proposed color layout 
descriptor for image/video retrieval. This descriptor describes the spatial distribution of col- 
ors in keyframes. In [4] region segmentation-based projective histograms and its moments 
are used as database indices. In [5] color and texture descriptors for image/video indexing 
are described. Dominant color descriptor describes the global as well as local spatial color 
distribution in images for high-speed retrieval and browsing. 

1.4 Objective of the Thesis 

The major problem of most of the indexing techniques is the higher dimensionality of fea- 
ture space. Lower dimensional indexing reduces the storage cost of indices and increases the 
retrieval speed. The methods proposed so-far for indexing and -retrieval uses three dimen- 
sional(3D) representation of color. If the dimensionality of color can be reduced to two then 
the dimensionality of feature space can be reduced and hence retrieval time can be reduced. 

The main objective of this thesis is to develop a content based video indexing and re- 
trieval system using two dimensional representation of color. To achieve this, the first step 
is to represent the three dimensional RGB color space in two dimensional plane. In [1] a 
two dimensional representation of color(YC) is proposed for compression applications. We 
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propose an indexing scheme to index videos in this space. Scene change detection algorithm 
of [13] is modified, such that both abrupt scene changes and gradual scene changes can be 
detected using the same data extracted form the DC frames of MPEG compressed video. 
A new method of keyframe clustering is proposed to reduce the redundancies in keyframe 
database. Color layout descriptor proposed in [3] is used for indexing the keyframes and 
retrieval of video segments. 

1.5 Organization of the Thesis 

The first step in our work is to obtain a two dimensional representation of color. In chapter 
2 the important properties of RGB and YUV color planes are discussed. Based on these 
properties a spiral approximation is discussed to obtain a single color signal. The image 
is transformed from RGB to YUV color space, and the two chroma signals (U & V) are 
combined to obtain the single color(C) signal. Segmentation of MPEG compressed video 
is detailed in chapter 3. Segmentation methods of uncompressed and compressed videos 
are reviewed, and proposed method of scene change detection is described in this chapter. 
Chapter 4 discusses the method used for keyframe selection, need for keyframe clustering, 
and also describes a new method for keyframe clustering. Chapter 5 describes the method 
of indexing keyframes and also discusses the retrieval of video segments. Chapter 6 reports 
the results of various experiments. Finally chapter 7 concludes the thesis with suggestions 
for future work. 
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Chapter 2 


The Color Plane and Spiral Map 


The study of color is important in the design and development of the color vision system. Use 
of color in image displays is not only more pleasing, but it also enables one to distinguish 
between thousands of colors. The color representation is based on the classical theory of 
Thomas Young (1802), who stated that any color can be reproduced by mixing a set of three 
primary colors. The RGB color assumes the red, green and blue as primary colors. 

The RGB color space is the basic choice for computer graphics and image process- 
ing and these are the primary additive colors. However RGB is not very efficient for real 
world images as transmission primaries, since equal bandwidths are required to describe all 
the three colors. So, in 1950s National Television Systems Committee (NTSC) developed 
the Y IQ system for transmission of color images, using the existing monochrome channels 
without increasing the bandwidth and maintaining compatibility with monochrome televi- 
sion system. The Y IQ system takes advantage of human visual system’s greater sensitivity 
to luminance than chrominance components. YUV is similar color representation used for 
PAL and SECAM systems. The YUV has an important propertips that facilitates the spiral 
mapping ofU — V plane. These will be discussed in the following sections. 
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2.1 Color Representation 


The perceptual attributes of color are brightness, hue and saturation. Brightness represents 
the perceived luminance. The hue of a color refers to its ‘redness’, ‘greenness’ and so on. 
For monochrome light sources, differences in hues are manifested by the difference in wave- 

I 

lengths. Saturation is that aspect of the perception that varies most strongly as more white 
light is added to a monochrome light. The Fig.2. 1 shows a perceptual representation of the 
color space. Brightness varies along the vertical axis, hue (6) varies along the circumference, 
and saturation (S) varies along the radial distance. For fixed brighmess W, the symbols R, G 
and B show the relative locations of red, green and blue spectral colors. 



Figure 2.1: Perceptual Representation of Color Space 

Luminance is an objective measure of that aspect of visible radiant energy that produces 
the sensation of brightness. Radiation of different wavelengths contributes differently to 
the sensation of brightness. We can also conclude that Y signal gives the luminance or light 
intensity as perceived by the eyes, regardless of the color of the object seen. The contribution 
of the primary colors for obtaining Y is given by 

y = (0.299)i? + (0.587)G - 1 - (0.114)S (2.1) 
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Hue is the predominant spectral color of the received light. The color of any object is 
distinguished by its hue or tint, like red, green, blue, brown, orange and so on. The spectrum 
of such signals lies along different phase angles of the color planes. In other words, hue is 
another name for the phase angle of the color plane which is discussed in next section. 

When a pure spectral color is mixed with white color, the color of the mixture will be 
less in puiity compared to actual color, but at the same time the brightness of the mixture 
increases. The study of such experiments on color planes reveals that the purest colors have 
the maximum radius and when white light is mixed the radius in the color planes decreases 
i.e. the purity of color decreases. 

In three dimensional RGB space, the RGB values of the colors give the co-ordinates of 
the corners of rectangular block as shown in Fig.2.2. The six faces of this block are planes 
given by R, = 0, G = 0, 5 = 0, and i?=l,G = l,jB = l and they enclose all the points in 
RGB space which have valid combinations of J?, G and B. 


Yellow White 



Figure 2.2: Valid Colors of RGB Color Space 
By transforming the RGB points on the surface of this block into YUV values, a block 
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can be constructed in 3-D YUV space which encloses all YUV values corresponding to all 
valid RGB values. This block will be referred to as the ‘YUV valid-Color’ block and its 
shape is shown in Fig.2.3. 



U 


Figure 2.3: Color Space Transformed From RGB to YUV 


Y = 0.299R -I- O.SSTG + 0.114B 

U = -0.299R - 0.587(? + 0.889B (2.2) 

V = 0.701R - 0.587G - 0.114B 

Since the RGB to YUV transformation is a linear process, the comers of the YUV 
valid-color are like the RGB block of Fig.2.3 given by the co-ordinates of the colors, and 
the surfaces are flat planes corresponding to R = G = B = 0 and R = G = B — 1. Further 
details of the shape of this YUV block can be conveniently illustrated by three, mutually 
perpendicular, views or projections obtained by viewing along the Y, U and V axes. These 
views are obtained by plotting U versus V, Y versus V, and Y versus U respectively as 
shown in Fig.2.4. 

In Fig.2.4, the four saturated colors having B = 0 (black, red , green and yellow) and 
those having B = 1 (blue, magenta, cyan and white) lie along straight lines in the Y-U plane. 
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Figure 2.4: YUV Color Space Along Different Axis 

This implies that the faces of the valid-color block given B = 0 and B = 1 are perpendicular 
to theY — U plane. Similarly, the faces given hy R — O and J? = 1 are perpendicular to the 
Y — V plane. 

2.2 Properties of Color Plane 

The color plane possesses very important properties. It can be observed from Fig.2.5 that 
the complementary pairs (Blue- Yellow, Red-Cyan, and Green-Magenta) fall exactly at 180° 
out of phase and the entire color plane is nearly split into six equal regions, between two 
adjacent major color lines like Blue-Magenta, Magenta-Red etc [1]. 
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Figure 2.5; Color Plane 

2.2.1 Behavior Along Radius 

For any chosen angle, the change in the primary colors for that angle is linear and systematic. 
Let us consider the case of blue color line, and choose the points at 0%, 25%, 50%, 75% and 
100% radii designated by Pu, P 21 , T^si, P 41 and Psi respectively in Fig.2.5. 
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Table 2.1: The comparison of all components along radial direction 


For all the calculations, we assume that Y=0.5. The 100% pine blue color point P 51 is 
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given hy U — 0.886, V = —0.114. For other points Pn to P 41 , percentage of the radius 
along the blue line is taken and the values for U andV are obtained. 

In Table 2.1 min{R, G, B) is the minimum value among the R, G, B values, and 

R' = R — min{R, G, B) 

G' = G-min{R,G,B) (2.3) 

B' = B — min{R, G, B) 

We can observe from the Table 2 . 1 , that for point Pu, the U and V values equal zero and 
the point is actually on the origin (achromatic point). The min{R, G, B) value is equal to 
0.5, which is also the value of Y. In other words min{R, G, B) is the quantity, which exists 
for all the colors, and contributes only to the Y values along that axis of the color plane. 

The remaining color quantities span the color plane and are given by the R' , G' and B'. 
These are obtained by subtracting min{R, G, B) from R, G, B values. 

Now considering the {R' ,G' ,B') values as given in Table 2.1, we note B' changes lin- 
early with the change in radius. Similarly, in the other directions, the changes along the radii 
are linear. However the ratio among R',G' , B' changes with angle, as discussed below. 

2.2.2 Behavior Along Phase 

I 

The points P 51 and P 55 are the pure blue and magenta colors respectively. The distance 
between these two points is split into three equal parts, and we get the points P 52 and P 64 as 
shown in Fig.2.5. For convenience, we would like to handle half radii points in the respective 
directions. These are shown by the points P 31 , P 32 , P 34 , P 35 . 

Once again keeping the value of Y fixed at 0.5 and varying the U and V of these points, 
we obtain the R' ,G', B' values as listed in Table 2 . 2 . 

We observe that G' and B' remains constant at 0.0 and 0.5, while R! changes linearly 
from 0.0 and 0.5. Hence the ratio between the color values changes with the change in phase 
angle. To make this change linear w.r.t. the phase angle, the maximum radius should be 
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Table 2.2: Comparison of various components along phase angle 


made equal in all directions. To achieve the same with unit radius, the U and V axes of color 
plane are multiplied by 1/0.9 and 1/0.7 respectively. Such a modified color plane is shown 
in Fig.2.6, with new axes as UU and VV respectively. 



= 52.125® 
=108.353® 
0y = 170.607® 
0 =232.125® 
©^ =288.353® 
©jj =350.607® 


Figure 2.6: Modified Color Plane 


2.2.3 The Color Gamuts 

The Y signal lies along the z — axis of the color planes and the actual signal space is three 
dimensional space. The color planes that we refer to are the planes corresponding to top 
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view of color space. Along the cross-section of this color space*for different values of Y, we 
get the color gamut for a particular value of Y, The shape of such gamuts varies in regular 
fashion as Y varies from 0.0 to 1.0 [1]. 


2.3 Spiral mapping of Color Plane 

In [1] a method to combine two chrominance signals by spiral mapping was suggested. The 
two chrominance signals shared many similar properties that are utilized for approximation 
in the color plane. In this thesis we utilize the approximation for efficient video indexing 
and retrieval. By indexing in approximated two dimensional YC [1] space we can reduce 
the length of feature vector and also the number of indices per keyframe, as compared to 
indexing in 3D Y ChCr space discussed in [3]. 

2.3.1 Spiral Approximation 

It is noted from above observations that saturaiion varies along the radius in the color 
plane, and hue varies along the phase in the color plane. As human visual perception is 
more sensitive to hxieiangle) than saturation(rodms), a change in radius is more tolerable 
than a change in phase angle. So approximating radius to some other value is acceptable as 
long as the phase angle is preserved. These points are exploited in arriving at spiral mapping 
of color plane. The luminance signal Y in YUV is preserved, and the color plane spanned 
by U and V is spiral mapped to get a single color signal that represents both U&V. In this 
section a method for approximating the color plane by spiral is discussed. 

2.3.2 The Nature of Spiral 

The spiral for five encirclements of the color plane is as shown in Fig.2.7. This spiral spans 
the whole color plane, if the radius is made unity in all directions of color plane, hence the 
modified color plane’s radius is taken as unity. Number of encirclements of spiral is denoted 
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by L, and the radius of any point on the spiral is denoted by C. • 



Figure 2.7 ; Spiral with L =5 on Color Plane 

The value of any spiral point is given by the radius C of that point on the spiral. Therefore 
for any given value of L, with C = 0.0 origin, the spiral point at the completion of the first 
circle will have the value equal to 1.0/L i.e. C = 1.0/L, similarly at completion of second 
circle, C = 2.0(1.0/L) and so on. Finally at the completion of circle, C = L(1.0/L) = 
1.0. 

We can calculate exact phase information from the spiral signal itself. This is illustrated 
in the following example 

Suppose L = 5 and C = 0.42 
CL = (5) X (0.42) = 2.1 

The point lies above two circles, i.e. on the third circle of the spiral and the exact phase 
angle is obtained by ignoring the integer part from 2.1, i.e. 2.1-2.0=0.1. It implies, that the 
point lies at a phase given by 

(0.1) X (360) = 36® (or) 10% of 360® 

For a given phase (j), it can have any one of the L values of the radii. For example, suppose 
L = 5 and cj> = 45®, the possible C values are 

= 0.025 
C 2 = l + L tp ®- ® ! = 0.225 
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= iL - il ±(45/360) ^ 0 325 

2.3.3 How to Map U-V Plane Onto Spiral ? 

Radius(r) and phase(0) of points in the color plane(U-V) are defined as follows: 

r = Vt/2 + V2 

6 = tarr^'^ (2.4) 

Points on the color plane are mapped onto spiral by approximating radius and preserv- 
ing phase. Any point on the color plane is mapped to the nearest point on the spiral. This 
mapping is as follows: The radius values of all the points in the color plane are approxi- 
mated to the concentric circles whose center is at the origin of color plane and radius values 
n/L, 71 = 1, 2, . . . , L, where L is the number of encirclements of the spiral. The value of 
angle which is given by {6 /{2TrL)) is added to this approximated radius to get the final value 
of C. So C can be expressed as, C = approximated radius 4- (0/(27rL)). This C is now 
a point on the spiral with radius(r) of the color point approximated to C. This way color is 
represented by only two quantities luminance(Y) and color(C). Mapping back from spiral 
color(C) to color plane(U-V) is as follows: 

U = C.cos{2t:{LC - [LC\)) (2.5) 

y = C.sin{2Tr(LC - [LC\)) 

the quantity [LC J gives the integral part of the product of L and C. 

2.3.3. 1 Example of Mapping 

Let radius r = 0.45 and d = 270° and L=5, then r is approximated to 0.4 and C = 0.4 4- 
(0/(27rL)) = 0.55. To remap to U-V, 0 - 2TriLC - [LC\) = 270° and using eqn.2.5, U 
and V can be obtained. From this example it is clear that, spiral mapping of color points is 
achieved by preserving phase, and approximating radius. 
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Chapter 3 


Segmentation of MPEG Compressed 
Video 


Temporal video segmentation is generally accepted as the first step in content based analysis 
of video sequences. It breaks the video into a set of meaningful and manageable segments 
called shots that are used as basic units for indexing and annotation. Each shot is represented 
by one or more keyframes. The content of the shot is indexed by spatial features like color 
and texture extracted from the keyframes. In addition temporal features from the shot like 
motion, camera operations, can be used for indexing. 

A shot is defined as one or more frames generated and recorded contiguously, and repre- 
senting a continuous action in time and space. Video editing produces two general types of 
shot transitions: abrupt and gradual. transitions are the most common and they oc- 

cur over a single frame by splicing the two distinct scenes successively. Gradual transitions 
occur over multiple frames and are result of effects such as dissolves aad fades. Dissolves 
show one image superimposed on the other as the frames of the first shot get dimmer and 
those of the second one get brighter. Fade is a gradual change in brightness usually resulting 
in or starting with a solid color. 

As the characteristics of the frames before and after an abmpt transition usually dif- 
fer significantly, abrupt transitions, are much easier to detect than gradual transitions. The 
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recognition of the gradual transitions is further complicated by the presence of camera op- 
erations like pans and zoom, and object movement as they exhibit temporal variation of the 
same order and cause false positives. 

Algorithms for scene change detection exist both for uncompressed and compressed 
videos. In the following section a review of methods for temporal video segmentation of 
uncompressed video is given. 

3.1 Temporal video segmentation of uncompressed video 

A majority of algorithms proposed for temporal video segmentation process uncompressed 
video. Usually, a similarity measure between successive frames is defined. When the two 
frames are sufficiently dissimilar, there may be a cut. Gradual transitions are found by using 
cumulative difference measures and more sophisticated threshold schemes. 

Zhang[6] proposed that a change between two frames can be detected by comparing 
the difference in the intensity values of the corresponding pixels in the two firames. His 
algorithm counts the number of pixels that have changed and an abrupt change is declared 
if the number of pixels that have changed, expressed as a percentage of the total number of 
pixels, exceeds a certain threshold[6]. However this technique may produce false detections 
since camera movements and object motion can have the same effect on a large number of 
pixels and hence false scene changes will be detected. 

In the likelihood ratio approach[6][7] the frames are subdivided into blocks, which are 
then compared on the basis of the statistical characteristics of th^ir intensity levels. Eqn.3.1 
represents the formula that calculates the likelihood function A. 

lii and fXi+i are the mean intensity values for a given region in two consecutive frames and 
ai and cr,+i are the corresponding variances. The number of blocks for which A exceeds a 
certain threshold are counted and if this number exceeds a certain value, a scene change is 
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declared. A subset of the blocks can be used to detect the difference between the images so 
as to expedite the process of block matching. This approach is better than the pixel-based 
approach as it increases the tolerance to noise associated with camera and object movement. 

It is possible that even though the two corresponding blocks are different they can have the 
same density function and in such cased no change is detected. 

Sensitivity to camera and object movement can be further reduced by comparing the gray 
level histogram of the two frames[6][8]. This is because the two frames whose backgrounds 
differ by a small amount and which have the same amount of object motion have almost the 
same histogram. The histogram is given by the number of pixels belonging to each gray level 
in the frame. The histogram metric is given by Eqn.3.2 

(3.2) 

j=i 

where G is the number of gray levels, j is the gray value, i is the frame number, and H (j) 
is the value of the histogram of for the gray level j. If the sum of absolute differences of 
corresponding values of consecutive histograms is greater than a given threshold Tk then a 

I 

transition is declared. 

Zabith et ai[9] have proposed a feature-based approach for detecting sudden scene changes. 
During a cut, new intensity edges appear far from the location of the old edges. Similarly, 
old edges disappear far from the locations of new edges. By counting the number of entering 
and exiting edge pixels, an abmpt scene chmge can be identified. However, this algorithm 
requires edge detection in every frame, which is computationally very costly. Another lim- 
itation of this scheme is that the edge detection method does not handle rapid changes in 
overall scene brightness or scenes with high contrast levels. 

The twin-comparison method[6] is a histogram based technique for abrupt as well as 
gradual transition recognition. In the first pass a high threshold Th is used to detect abrupt 
transitions. In the second pass a lower threshold Ti is employed to detect the potential starting 
frame of gradual transition. Fg is then compared to subsequent frames. This is called an 
accumulated comparison as during a gradual transition this difference value increases. The 
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end frame of the transition is detected when the difference between consecutive frames 
decreases to less than Tj, while the accumulated comparison has increased to a value higher 
than T/i- If the consecutive difference falls below T; before the accumulated comparison 
exceeds Tk, then the potential start frame Fg is dropped and the search continues for other 
gradual transitions. However for most gradual transitions the frame difference falls below the 
lower threshold. Such transitions can not be detected using the twin-comparison technique. 

Several statistical-feature based techniques have also been proposed for gradual transition 
detection. Alattar used quadratic behaviour of the variance to detect fading. This algorithm 
can only detect fade-in and fade-out where the end frames' are fixed. When the sequence has 
considerable motion, this algorithm fails to identify fade-in and fade-out regions. 

Since M PEG was established as an international standard for compression of digital 
video, videos are increasingly stored and transmitted in compressed format. Hence, it is 
highly desirable to develop methods that can operate directly on the encoded stream. Work- 
ing in the compressed domain offers the following advantages. Firstly, by not having to 
perform decoding and re-encoding, computational complexity is reduced and saving on both 
decompression time and storage is obtained. Secondly, operations are faster due to the lower 
data rate of compressed video. Last but not the least, the encoded stream already contains a 
rich set of pre-computed features, such as motion vectors and block averages that are suit- 
able for temporal video segmentation. The following section gives an overview of MPEG 
compression standard. 

3.2 Overview of MPEG compression standard 

The Moving Pictures Expert Gio\xp(M PEG) standard is the most widely accepted interna- 
tional standard for digital video compression. MPEG video is broken up into a hierarchy 
of layers to help with error handling, random search, editing and synchronization. The first 
(top) layer is known as video sequence layer. The second layer below is the group of pic- 
tures {GOP) layer. The third layer is the picture layer itself, and the layer below that is 
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called the slice layer. Each slice consists of macroblocks (MBs), which are 16 x 16 arrays 
of luminance pixels, or picture data elements, with 8x8 arrays of associated chrominance 
pixels. Macioblocks are the units of motion compensated compression. These macroblocks 
are further divided into 8x8 blocks of pixels. 

M PEG compression uses two basic techniques: macroblock based motion compen- 
sation to reduce temporal redundancy and transform domain block-based compression to 
capture spatial redundancy. An MPEG stream consists of three types of pictures - in- 
tra coded( /-frames), predicted(P-frames) and bi-directional(B-frames). These pictures are 
combined in a repetitive pattern called group of pictures (GOP). A GOP starts with an / 
frame. The / and P frames are referred as anchor frames. B frames appear between each pair 
of consecutive anchor frames. Fig.3.1 shows a typical MPEG video sequence including a 
GOP of 12 frames: the sub-GOP size is 3. 


foryvard prediction backward prediction 



group of pictures 


Figure 3.1: Typical MPEG compressed video sequence 

Each video frame is divided into a sequence of nonoverlapping MBs. Each MB can 
be either intra coded or inter coded (i.e. coded with motion compensation). / frames are 
typically intra coded: every 8x8 pixel block in the MB is transformed to the frequency 
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domain using the Discrete Cosine Transform {DOT). The first DCT coefficient is called 
the DC term and is 8 times the average intensity of the respective block. The 64 coefficient 
is then (lossy) quantized and (loss less) entropy encoded using the Run Length Encoding 
(RLE) and Huffman entropy encoding. As I frames are coded without reference to any 
other video frames, they can be decoded independently and hence provide random accesses 
points into the compressed video. ' 

P frames aie predictively coded with reference to the nearest past anchor frame (i.e. the 
previous I or P frame). For each MB in a P frame, the encoder searches the anchor frame 
and finds the best matching block in terms of intensity. The MB is then represented by a 
motion vector(MV) which points to the position of the match and the difference(residue) 
between the M B and its match. The residue is then DCT encoded, quantized and entropy 
coded while the MV is differentially and entropy coded with respect to its neighbouring 
AIV. This is called encoding with forward motion compensation. An inter coded MB 
provides higher compression gain than an intra coded as the residue can be represented with 
fewer bits. 

To achieve further compression, B frames are bi-directionally predictively encoded with 
forward and/or backward motion compensation with respect to the nearest past and/or future 
anchor frames. As B frames are not used as reference for coding other frames, they can 
accommodate more distortion, and thus, provide higher compression compared to I and P 
frames. During the encoding process a test is made on each MB of P and B frame to see 
if it is more expensive to use motion compensation or intra coding. The latter occurs when 
the current frame does not have much in common with the anchor frame(s). As a result each 
MB of P frame could be coded either intra or forward, while for each MB of a B frame 
there are four possibilities: intra, forward, backward or interpolated. 
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3.3 Temporal video segmentation of compressed video 


Several algorithms for temporal video segmentation in the compressed domain have been 
reported. These algorithms use a set of pre-computed features such as DCT coefficients, 
DC terms, motion vectors(MV'’), and MB coding mode, which the encoded stream already 
contains. 

One approach to using the DCT coefficients to find frames where camera breaks have 
occurred is as follows[10]. From the 8 x 8 blocks of a single video frame, m, that have 
been encoded using the DCT, a subset of blocks are chosen apriori. The blocks are chosen 
from n connected regions in each frame. Again a subset of 64 coefficients for each block is 
chosen. The coefficients chosen are randomly distributed among the AC coefficients of the 
blocks. Taking coefficients from each frame a vector is formed as follows: 


Kn = [ci)C2,t;3)C4, • • •] 


(3.3) 


This vector represents the frame of the video in DCT space. The inner product is used to 
find the difference between the two frames: 

^TnMm+l 


ljj = 


(3.4) 


|Kn|lK.+l| 

where is the vector of the frame being compared and Vm+i is the vector of the successor 
frame. A transition is detected when 1 - |^| > f,, where t is some threshold. 

Zhang et a/[l 1] have also experimented with motion-based segmentation using the mo- 
tion vectors in the MPEG compressed data as well as the DCT coefficients. Meng at al 
have extended this concept further by performing more detailed operations on the MPEG 
compressed data. If there is a break at S-frame, most of the motion vectors will come from 
the following anchor frame and few will come from the previous anchor frame. A, scene cut 
is detected based on the ratio of the backward and forward motion vectors. When a scene 
change occurs at a P-frame the encoder cannot use macroblocks form the previous anchor 
frame for motion compensation as P-frames have only forward motion compensation. A 
scene break is detected based on the ratio of macroblocks without motion compensation to 
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macroblocks with motion compensation. Since /-frames are completely intra-coded, with- 
out motion vectors, the method using DCT coefficients described above[10] can be used for 
scene change detection. 

The algorithm we propose for scene change detection operates on the DC sequence ex- 
tracted form the MPEG compressed video. The following section describes the extraction 
of DC sequences from M PEG compressed video. 

3.4 Extraction of DC sequences from MPEG compressed 
video 

A reduced image obtained from the collection of scaled DC coefficients in DCT compressed 
video is called a DC image. They are spatially reduced versions of the original images. 
Fig.3.2 shows an original image of size 352 x 288 and Fig.3.3 shows its DC image of size 
44 X 36. 

While the DC image is much smaller than the original image as it occupies only a small 
fraction of the original data size, it still retains most of the essential ‘global’ information. 
This suggests that scene change operations of a global nature that are performed on the 
original image can also be performed on the DC image. Operating on these DC images 
offers a significant saving in computations. In this section, it is shown how DC image can 
be effectively extracted from MPEG compressed videos as suggested by Yeo and I,,iu [12]. 

Extraction of DC image from /-frame is trivial. The DC term c(0, 0) is related to the 
pixel values /(i,y) by 

= (3.5) 

^ x=0 1/=0 

which is 8 times the average intensity of the block. Thus DC image is formed from block- 
wise averaging of the original image. 

For P and S-frames, motion information must be employed to derive the DC image. 
To obtain the DC coefficients of the P-frame using the DCT coefficients from the pervious 
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Figure 3.2: Full image at 352 x 288 



Figure 3.3: DC image at 44 x 36 


/-frame, refer to Fig.3.4. P^e/ is the current block of interest. Pi, . . . , P 4 are the four original 
neighbouring blocks from which P^e/ is derived. Let hi and Wi be the height and width of 
Pref n Pi. The shaded regions in Pi, . . . , P 4 are moved by (toi, hi). Due to the linearity of 
the DCT, the DC coefficient of Pref is of the form, 

DC{p.,f) = E f E E“'L(Bcr(F.)) ) 0 . 6 ) 

iz=l \ m =0 /=0 / 

where we denote (m,l) component of Pi by {Pi)mi- The factor wi^ weights the contribution 
of {DCT{Pi))jni. The evaluation of eqn.3.6 may take a maximum of 256 multiplications. To 
reduce the number of multiplications Yeo[12] proposed first order approximation of eqn.3.6. 
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Figure 3.4: Reference Block(Pre/)> Motion Vectors and Original Blocks 


Under this approximation, the DC coefficient of Pref, denoted as DC^PrefY, 

tained by weighing the contributions from the 4 neighbouring DC values with the ratio of 

overlaps of the block Pref with each of the blocks Pi, . . . , P 4 respectively. That is, 

DC{PrefY = E (3.7) 

Here at most 4 multiplications are required to obtain each DC value. Only DC coefficients 
of the 4 neighbouring blocks {Pi}, and motion vector information are used to obtain the DC 
values. Such infonnation is easily obtained from MPEG compressed bit streams. 

Under the first order approximation, DC value of Pref given in eqn.3.6 can be written as 
a sum of first order approximation DC (P^e/)^, and an error term that does not depend on the 
DC coefficients of the reference blocks, i.e, 

DC{Pref) = Da{PrefY+C (3.8) 

where c is the error due to first order approximation and is independent of the DC coefficients 
of Pi. 

This approach can be extended to extract DC images of P-frames. It is observed from 
the expressions that extracting DC coefficients using the suggested approximation for P- 
frames yields DC images that are very close to the actual one. 
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3.5 Proposed Scene Change Detection Algorithm 

Fernando et al[ 1 3] proposed an algorithm to detect gradual scene changes using statistical 
features (mean & variance) of luminance component of each frame. To detect abrupt scene 
changes they have used a different algorithm, which uses macroblock characteristics of the 
B-frames. In our proposed work we have modified this algorithm, such that both abrupt 
and gradual changes can be detected using the same data(mean & variance). We have also 
studied the effect of color on scene change detection. 

3.5.1 Detection of abrupt scene change 

We use a sliding window to examine the m successive frame differences. Frame differ- 
ences are the difference between the mean values of the successive DC images. Let Xi, 
i = 1, 2, 3, . . . , be a sequence of dc images. We form the difference sequence Du 
i = 1, 2, 3, . . . , N — 1 as 

Di = \d{XuXi+i)\. (3.9) 

Where d() is the difference of the mean values of Xi and Xj+j. We declare a scene change 
from Xi to Xi+i if, 

1 . the difference is the maximum within a symmetric sliding window of size 2m — 1, i.e., 

Di ^ dllj, j — I — m "t” 1) • • • , 1 — 1, 1 1, . . . , 1 4“ m — 1, and 

2. Di is also n times of the second largest maximum in the sliding window, where n 
should be at least 2.5. Performance of cut detection for different values of n is dis- 
cussed in sec.6.2.1. 

The parameter m is set to be smaller than the minimum duration between two scene changes. 
It is found experimentally in [14] that m = 10 gives good results. 

Criterion 2 is imposed to guard against fast panning and zooming of scenes and also 
to prevent scenes with ca m era flashes to be declared as scene change. Fast panning and 
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zooming usually manifest themselves as sequences of large values in the difference sequence, 
Di- Flashes typically produce two consecutive sharp peaks. If Di declares a scene change, 
then it is not necessary to check Dj+i, . . . , Dj+m-i and we can immediately proceed to check 

The detection performance can be improved if we combine the results from both lumi- 
nance and color. One-way of doing this is to declare a scene change from i to i -1- 1 if there is 
a scene change in either the luminance or the color component. Besides improving detection 
performance it also increases the number of false detections. To avoid this problem we form 
the difference sequence Di, i = 1, 2, 3, . . . , AT — 1 as, the sum of differences of the mean 
values of luminance(y) and color(C) of successive frames respectively, and the modified 
equation for Di becomes, 

Di = ldy(Xi,Xi+i)l + l<ic(Xi,Xf+i)l. (3.10) 

Where dy and dc are the differences of the mean values of the luminance and color compo- 
nents of Xi and Xj+i respectively. 

This approach is found to perform better compared to the case of using only luminance 
component and the case of combining the results from both the components. The results of 
cut detection algorithm are reported in section.6.2.1. 

3.5.2 Detection of Gradual scene changes 

In video editing and production, two or more picture signals are simply added together so 
that the two pictures appear to merge on the output screen. Very often this process is used to 
move on from picture A to picture B. In this case, the proportions of the two signals are such 
that as the contribution of picture A changes from 100% to zero, the contribution of picture 
B changes from zero to 100%. This is called dissolving. When picture A is a solid color, 
the process is called fade-in, and when picture B is a solid color it is known as fade-out. 


27 



Mathematically dissolving can be expressed as follows: 


Sn{x,l/) 


fn{x, y) + (^).9n('f, y) 


0 < n < Li 
^ ^ ^ (-^1 " 1 “ 


[ 9n{x,y) 


(Z/l -f- jP) 71 Z/2 


(3.11) 


where S„(a:, y) is the resultant video signal, fn{x,y) is picture A, gn{x,y) is picture B, L\ 
is the length of sequence of picture A alone, F is the length of the dissolving sequence, and 
Lt is the length of the total sequence. 

It can be proven from Eqn.3.11 that during dissolving, the mean(m) and variance(cr) 
have a linear and quadratic behaviour, respectively, as shown in Eqn.3.12 and Eqn.3.13. 


= E[Sn{x,y)] 
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m„ 


0 < n< Li 

Li<n<(Li + F) (3.12) 
(i'l + F) < w < X<2 
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Li<n< {Li + F) 

(3.13) 



{Li + F) < n < L 2 


where 




cj> = 

- 

.(¥+2«)n + (,J + L;e+5i^), ( = ^ 

(3.14) 


Gradual transitions can be detected using the linearity & quadratic (parabolic) behaviour 
of mean & variance respectively. Fernando et aZ [13] proposed a method to combine these 
mean and variance to find gradual transitions. The ratio of first derivative of mean to second 
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derivative of variance should be constant during a dissolve period, is taken as criterion for 
detecting gradual transitions in his method. In practice, this ratio is not a constant and it will 
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have variations. In our proposed work, we suggest the following approach to detect gradual 
scene changes. 

3.5J2.1 Detection of dissolves 

As the variance has quadratic bahavfour in dissolve region, the differential values of vari- 
ances of successive DC frames are negative for the first half of the dissolve region and pos- 
itive for the next half of the dissolve region. The set of frames with these characteristics 
is {5} and the number of frames in this set is Fi- If Fi > F, where F is the number of 
frames in the dissolve region as given in Eqn.3.11, then there is a dissolve w.r.t variance, 
and it’s presence is confirmed, if the mean values of {S} have linear behaviour. If the dif- 
ferentied values of means of successive DC frames of {5} are all either positive or negative 
depending on whether the mean has positive or negative slope, then the presence of dissolve 
is confirmed and the starting frame of {5} is taken as the starting frame of the dissolve. 

3 . 5.22 Detection of fading 

Both fade-in and fade-out can be considered as special cases of “dissolve” in which, one 
scene is a solid color. Fade-in starts with a solid color, and fade-out ends with a solid color. 
So the variance is zero at the start of fade-in and at the end of fade-out, as all the pixels 
have die same value, namely that of a solid color. Mean values of frames in fade region wiU 
exhibit Knear behaviour. Thus, fade-in and fade-out are identified as follows: ■ 

Fade-in: Detect a frame with zero variance followed by a set of frames during which differ- 
ential values of variances of successive DC frames are positive. 

Fade-out: Detect a continuous set of frames during which differential values of variances of 
successive DC frames are negative followed by a frame with zero variance. 

Experimental results of gradual scene change detection are reported in section 6.2.2. 
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Chapter 4 


Keyframe Extraction and Clustering 


In this chapter we discuss the methods used for keyframe extraction and keyframe clustering. 
Keyframe extraction involves the selection of representative frame(s) that represents the en- 
tire shot. Keyframe clustering refers to finding visual similarities between the keyframes of 
different shots of video. It reduces the redundancies in the keyframes of videos. Section 4.1 
discusses keyframe extraction and in section 4.2 the proposed keyframe clustering algorithm 
is discussed. 


4.1 Keyframe extraction 

A video keyframe is the frame that can represent the salient content of a video shot. Keyframes 
provide a suitable abstraction for video indexing, browsing and retrieval. Users can quickly 
browse over the video by viewing only a few highlighted keyframes. The use of keyframes 
reduces the amount of data required in video indexing and browsing and provides an or- 
ganizational framework for dealing with video content. It is important, that the choice of 
keyframes be made carefully, since a keyframe representing the entire shot will be needed 
for further processing. 

The ideal method of selecting keyframes would be to compare each frame with one 
another in the scene and select the frame with the least difference from other frames in terms 
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of a given similarity measure. But, it is a prohibitively expensive activity for large amount 
of video. Typical approaches to keyframe selection involve choosing a frame from a fixed 
position in the scene, or several frames separated by fixed distance. Choosing a &ame in a 
fixed position in a scene (e.g. first, middle, last) can miss the content and action present in a 
scene. Better representative keyframes could be chosen if scene content were considered in 

I 

the selection. “Scene content” involves pixel values and similar global measures of a scene. 

The algorithm we used for keyframe extraction uses kurtosis of the pixel values in a 
frame to select the representative keyframes[15]. Kurtosis, also called the excess coefficient, 
measures the degree of peakedness of a distribution, and it is expected to be better suited for 
selection of keyframes since it gives more weight to the shape of the distribution. Kurtosis 
is denoted as 72 (or 62 ) and is computed by taking the fourth moment of a distribution. Let 
jii denote the i-th moment. Then the Fisher kurtosis used in our calculations is defined by 


^4 0 M4 o 

72 = 02 = - 2-3 = - 7-3 

ni 0-4 


( 4 . 1 ) 


Kurtosis for all the frames in a video sequence are calculated. The frames with the high- 
est and lowest kurtosis are the candidates for the representative keyframes. It is found in[15] 
that kurtosis based selection of keyframes perform better than the algorithms using average 
frame as well as histogram difference metrics for selecting keyframes. In the first group of 
algorithms, an average frame for the whole scene is calculated and the frame whose distance 
is maximum or minimum from the average frame is selected as the candidate keyframe. 
Whereas in the second group of algorithms histogram difference between the successive 
frames is calculated and the frame with maximum or minimum distance is the candidate 
keyframe. The algorithm that uses kurtosis is found to perform better than the above men- 
tioned algorithms. 
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4.2 Keyframe clustering 


Visual similarity is usually present in different video shots due to same locations or persons 
or events. The need to find the shots of similar contents is thus of considerable interest. In ad- 
dition, underlying story structure inherent in a video document is often reflected by repeating 
scenes of similar visual contents. By grouping video shots together, a compact representa- 
tion of video can be obtained. This grouping of shots can be achieved by comparing the 
keyframes of the shots, since every shot is represented by a keyframe. Finding visual simi- 
larities between the keyframes of different shots is keyframe clustering. Besides producing 
a compact representation of video, keyframe clustering also reduces the retrieval time as the 
size of keyframe database with keyframe clustering is less than the size of keyframe database 
without clustering. 

4.2.1 Method of Clustering 

Keyframe clustering starts with the second keyframe of every video. Every keyframe start- 
ing from the second keyframe is compared with the clustered keyframes of that video. For 
each keyframe best match is found from the clustered keyframes. If the similarity distance 
between the best match and the keyframe is less than a threshold(e) then this keyframe is 
clustered with that best match. If the distance is greater than e then there is no match to 
the keyframe, and is added to the list of clustered keyframes. For similarity distance calcu- 
lations between keyframes, color layout descriptor [3] is used. As explained later, similar- 
ity distance function between keyframes is the difference between DCT coefficients of the 
keyframes. If the two keyframes that are being compared have similar spatial distribution, 
then the frequency distribution will be similar and the difference between DCT coefficients 
will be small. So, the value of e should be small. A typical value of e = 1 is chosen. 
Performance of keyframe clustering for different values of e is discussed in sec.6.3. 
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4.2.2 Color Layout Descriptor method 

Color layout descriptor(CLD) specifies the spatial distribution of colors in keyframes. The 
extraction of descriptor consists of four stages: keyframe partitioning, dominant color extrac- 
tion, DCT transform, zigzag scanning of DCT coefficients. This is illustrated in Fig.4.1. 
In the first stage, every keyframe is partitioned into 64 blocks. The size of each block is 
W/S X H/8 where W and H denote the width and height of an input picture, respectively. 
In the second stage, a single dominant color is selected in each block to build a tiny image 
whose size is 8 x 8. Any method for dominant color selection can be applied. We use sim- 
ple average colors as the dominant colors. In the third stage, both the components (Y and 
C) are transformed by 8 x 8 DCT, and we obtain 2 sets of DCT coefficients. A few low 
frequency coefficients are extracted using zigzag scanning. It is found in [3] that optimized 
descriptor requires 6 coefficients for luminance and 3 coefficients for color components. In 
the present approach we use 6 coefficients for Y and 3 coefficients for C to represent each 
keyframe. Two out of these 9 coefficients are DC coefficients and other 7 coefficients are AC 
coefficients. DC coefficients are the zero frequency coefficients of the DCT transformed 
luminance(Y) and color(C). AC coefficients are the coefficients other than DC coefficients 
of the DCT transformed luminance(Y) and color(C). 



Figure 4.1: Block diagram of the color layout descriptor extraction 
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4.2.2.1 Similarity distance function 

To find similarity between two different keyframes, color layout descriptor coefficients of 
the keyframes are compared. Similarity distance between two color layout descriptors is 
expressed in equation form as follows: 

D = .f2 wliiYU - Y 2 iy + , j2 - C 2 iy ' (4.2) 

N i=l \ i=7 

where D is distance, and mli and w2i are the weighing values for the i-th coefficient. Y1 & 
Y2 are CLD coefficients of luminance, and Cl & C2 are CLD coefficients of color(C) of the 
two keyframes which are compared. 

Using this similarity measure keyframe clustering is performed as described in sec.4.2.1. 
Results of keyframe clustering are reported in sec.6.3. 
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Chapter 5 


Indexing and Retrieval 


Keyframes form the basis for indexing each video partition(shot). Further applications such 
as, video browsing, and content based retrieved of video, depends on the efficient indexing of 
these keyframes. A straight forward approach of indexing these keyframes is to represent the 
visual contents in textual form(e.g. keywords and attributes). But, there are several problems 
with this method. First of all, human intervention is required to describe and tag the contents 
of the visual data in terms of the selected set of captions and keywords. In video, there are 
several objects that could be referenced, each having its own set of attributes. As the size of 
the database grows, the use of keywords not only becomes complex but also inadequate to 
represent the video content. If the video database is to be shared globally, then the linguistic 
barriers will make the use of keywords ineffective. To overcome these difficulties, content- 
based video retrieval emerged as a promising means for describing and retrieving videos. 
Content-based video retrieval systems describe videos by their own visual content rather 
than text, such as color and texture. Since keyframes are the representative frames of the 
video, they are indexed by one or a combination of the above features. 

In the following section a brief introduction to content descriptors of video is given. 
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5.1 Video Content Descriptors 

Video content can be described by visual features like color, texture; and motion features of 
shots. Video content can be described by one or a combination of these features. 

5.1.1 Visual Color Descriptors 

Color is one of the most widely used visual features in image and video retrieval. Color fea- 
tures are relatively robust to changes in the background colors and are independent of image 
size and orientation. Color descriptors can be used for describing content in still images and 
video, respectively. 

Scalable Color Descriptor (SCD) : One of the most basic descriptions of color features is 
provided by describing color distribution in images. If such a distribution is measured over 
an entire image, global color features can be described. The MPEG-7 generic SCD is a color 
histogram encoded by a Haar transform. It uses the HSV color space uniformly quantized to 

I 

255 bins. To arrive at a compact representation the histogram bin values are nonunifoimly 
quantized in a range from 16 bits/histogram for a rough representation of color distribution 
and up to 1000 bits/histogram for high-quality applications. Matching between SCD realiza- 
tions can be performed by matching the Haar coefficients or histogram bin values employing 
an LI norm. 

Dominant Color Descriptor : This color descriptor aims to describe global as well as lo- 
cal spatial color distribution in images for high-speed retrieval and browsing. In contrast 
to the color histogram approach, this descriptor arrives at a much more compact represen- 
tation at the expense of lower performance in some applications. Colors in a given region 
are clustered into a small number of representative colors. The descriptor consists of the 
representative colors, their percentages in a region, spatial coherency of the color, and color 
variance. 

Color Structure DescriptoriCSD) : The main purpose of the CSD is to express local color 
features in images. To this aim, a pel structuring block scans the image in a sliding window 
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approach. With each shift of the structuring element, the number of times a particular color 
is contained in the structure element is counted, and a color histogram is constracted in such 
a way. 

Group-of-Frames/Group-of-Pictures ( GoF/GoP) Color Descriptor : The GoF/GoP color de- 
scriptor defines a stmcture required for representing color features of a collection of simi- 
lar frames or video frames by means of the SCD. It is useful for retrieval in image and 
video databases, video shot grouping, image-to-segment matching, and similar applications. 
It consists of average, median, and intersection histograms of groups of frames calculated 
based on the individual frame histograms. 

5.1.2 Visual Texture Descriptors 

Texture refers to the visual patterns that have properties of homogeneity or not, that result 
from the presence of multiple colors or intensities in the image. It is a property of virtually 
any surface, including clouds, trees, bricks, hair and fabric. Describing textures in images by 
appropriate texture descriptors provides powerful means of similarity matching and retrieval. 
Some descriptors of texture are homogenous texture descriptor, and non-homogenous texture 
descriptor[5]. 

5.1.3 Motion Descriptors for Video 

Description of motion features in video sequences provides information about its content. 
In general, describing motion in video by motion fields can be very expensive in terms of 
bits per image, even if motion vector fields are coarse. MPEG-7 has developed descriptors 
that capture essential motion characteristics from the motion field into concise and effective 
descriptions. Motion activity descriptor, camera motion descriptor and Motion trajectory 
descriptor are some of the MPEG-7 motion descriptors [5]. 
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5.2 Indexing Keyframes 

In this section the method of indexing keyframes is discussed. We use color layout descrip- 
tor proposed in [3] to index keyframes. Color Layout Descriptor extraction is explained in 
section 4.2.2 in the context of keyframe clustering. As explained in that section, we use 6 
coefficients for Y and 3 coefficients for C to index each keyframe. 

For the purpose of comparison with 3D representation of color, indices are extracted 
both in y C domain and Y CbCr domain. Color Layout Descriptor algorithm[3] is proposed 
for Y CbCr. Hence 3D-Y CbCr representation is chosen for comparison with 2D-Y C. For 
y CbCr representation, 6 coefficients for Y, 3 coefficients for Cb, and 3 coefficients for Or 
are taken as given in [3]. These indices are stored in the database along with the keyframes to 
facilitate the retrieval of video segments. The advantage of indexing in YC domain is that we 
can represent each keyframe with only 9 coefficients, while YCbCr requires 12 coefficients. 
This reduces the storage cost in storing the indices. The reduction in storage in terms of 
number of bits is discussed in sec.6.4 

5.2.1 Advantages of color layout descriptor 

The advantages of this descriptor are: 

• that there is no dependency on image/video format, resolutions, and bit-depths. The 
descriptor can be applied to any still picture or video frames even though their resolu- 
tions are different. It can be also applied both to a whole image and to any connected 
or unconnected parts of an image with arbitrary shapes. 

• that the required hardware/software resources for the descriptor are very small. It 
needs as low as 6 bytes(YC) per image in the default video frame search, and the 
calculation complexity of both extraction and matching is very low. It is feasible to 
apply this descriptor to mobile terminal applications where the available resources is 
strictly limited due to hardware constrain. 
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• that the captured feature is represented in frequency domain, so that users can easily 
introduce perceptual sensitivity of human vision system for similarity calculation. 

• that it supports scalable representation of the feature by controlling the number of 
coefficients enclosed in the descriptor. The user can choose any representation gran- 
ularity depending on their objectives without interoperability problems in measuring 
the similarity among the descriptors with different granularity. The default number of 
coefficients is 9(YC). 

5.3 Retrieval 

Retrieval is one of the main applications of video indexing. A content-based video retrieval 
system is a querying system that uses content as a key for the retrieval process. Since 
keyframes represent the content of the video, whenever a query is posed, it is compared 
with the keyframes that are stored in the database. A ranked set of keyframes with high 
matching scores is presented. The retrieval of video sequence follows two steps; 
stepl; The query is processed and relevant features are extracted. 

step2: The features are compared with the features of the keyframes stored in the database, 
and best matches are retrieved. 

We implemented two types of querying mechanisms: query by frame and query by clip. 
Query by frame: The query frame is processed and indices of the query are extracted as 
explained in sec.5.2. These indices are compared with the indices of the keyframes in the 
database. The keyframes and associated video segments which have least distance from the 
query are retrieved. 

Query by clip: In this case the query is a clip. Query clip is processed to detect scene 
changes, and to extract keyframes. For each keyframe the matched segments are retrieved as 
described in query by frame. 
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5.3.1 Similarity/Distance measures 

When a query is posed to the system, indices of the query are extracted and they are compared 
to the indices stored in the database. Distance between two descriptors is calculated as 
follows: 

For Y C representation: 

+ I Y, w2iiCi-ClY (5-1) 

V ie(C) 

Here, Vj, Cj denote the i-th coefficient of Y,C color component and wU and w2i are the 
weighing values for the i-th coefficient respectively. The weighing values should decrease 
according to the zigzag-scan order. The reconunended values[3] are power of 2 to accelerate 
the speed with only shift operations. 

For Y CbCr representation: 

D= I Y ^U(Yi - + / E - Cbiy + 

V jeCK) V 

Here, Yi, Cbi and Crj denote the i-th coefficient of Y,Cb,Cr color component, and ruli, w2i 
and wZi are the weighing values for the i-th coefficient respectively. 

The similarity calculation cost is proportional to the number of coefficients enclosed in 
the descriptor. From Eqn.5.1 and Eqn.5.2 we can say that, distance calculation cost in YC 
domain is less than that of YCbCr. The reduction in retrieval time is discussed in sec.6.5. 
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Chapter 6 


Experimental Results 


In this chapter we present the simulation results obtained for all the experiments conducted 
on various video sequences. 


6.1 Results on y C Transformed Images 

To evaluate the performance of spiral mapping we experimented on different images con- 
taining a variety of colors, each of size 256 x 256 in the RGB color space. The RGB values 
of the pixels are transformed to YJJUyV values. The Y value is preserved, and UU,VV 
values are approximated on to a spiral for a particular number of encirclements L to get the 
color value G. All the pixel values of an image are now represented by two quantities Y and 
G, instead of the three quantities R, G and B. To recover original image from YC repre- 
sentation, an inverse operation is performed. Using C, the approximated values for UU and 
VV are obtained. The Y, UU and VV are transformed back to RGB. This is done for all 
the pixels in an image. , 

6.1.1 Experimental results for varying L and corresponding PSNR 

Spiral transformation was carried out for different test images, and for different number 
of encirclements L. Original and reconstructed images for different L values are shown 
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in Figs. 6.1 6.2. A numerical evaluation of the reconstructed image is made by computing 
the PS NR as given in Eqn.6.1 between the original(O) and the reconstructed(i?) images 
respectively. 


PS NR = lOlogiQ 


3 * 2552 * N^ 

XU Ef=i SLi k) - RiiJ, k)]\ 


( 6 . 1 ) 






Figure 6. 1 : Reconstructed lena Images For Different Values of L, (a) original image, (b) 
L=3,(c)L=5,(d)L=7 

Table.6.1 shows the PS NR values of some of the test images for different values of L. 
The color loss in processed images is indistinguishable. By observing the processed images 
for different values of L, we note that L = 7 yields the best results. 
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Figure 6.2; Reconstructed baboon Images For Different Values of L, (a) original image, (b) 
L=3, (c) L=5, (d) L=7 


S.No. 

Image 

PSNR for (1=3) 

PSNR for (L=5) 

PSNR for (L=7) 

1 

lena 

26.8129 

33.3561 

34.1870 

2 

baboon 

25.4863 

30.5228 

33.9277 

3 

flowers 

24.7301 

29.6687 

33.0210 

4 

snow 

22.4478 

28.0073 

31.9777 


Table 6.1 : PSNR values for reconstructed images 
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6.2 Performance of scene change detection algorithm 


To evaluate the performance of the proposed method, we experimented on different videos 
of varying characteristics. DC images of the compressed videos are extracted as discussed 
in sec.3.4. All these DC images are mapped from RGB to YC domain. Mean and variance 
of Y and C of all the DC images are calculated. 

6.2.1 Performance of cut detection 

For the detection of cuts differential means of the successive DC images are calculated. 
Fig.6.3 shows the plot of differential means(Di) of the DC images of a movie clip from the 
movie Baby’s day out. The movie clip has 4104 frames and 30 abrupt scene changes(cuts). 
The figure shows the plots for both the cases of cut detection. The first plot is for the case 
when luminance(Y) means are used and the second plot is when luminance(Y) and color(C) 
means are used. From the figure we observe that, there are miss detections at frame numbers 
3255 and 3797 when only Y is used, while they are detected when both Y and C are used. 
Similar observation can be made from Fig.6.4 where there is a miss detection at frame num- 
ber 2302 when only Y is used, while it is detected when both Y and C are used. There is a 
miss detection at frame number 2344 when both Y and C are used, while it is detected when 
only y is used. For the values of m = 10, n = 3 and L — 7, there is 1 miss detection and 
4 false detections when both Y and C are used, while there are 3 miss detections and 3 false 
detections when only Y is used. The figures 6.3 and 6.4 show the advantage of using both 
luminance and color versus using only luminance, in cut detection. 

6.2.1.1 Performance of cut detection for different values of n 

Lower values of n has the advantage of less number of miss detections, and the disadvantage 
of large number of false detections. While the reverse happens for higher values of n, i.e, 
higher number of miss detections and less false detections. So, the value of n should be 
selected such that the number of miss detections and false detections is moderate. There 
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change detection algorithm. They are defined as follows: 


Recall{R) 

Precision{P) 

where, 

Ni : Number of false shot boundary detected by the method(false detections), 

Nd : Number of shot boundary not detected by the method(miss detections), and 
Nt : Number of actual shot boundary in the baseline. 

The value of n should be selected to have a high value of both precision and recall. Table 
6.2 gives the number of miss detections and false detections of a movie clip from the movie 
Baby’s day out, for different values of n, and for m = 10 and L = 7. From the table it is 
clear that the values of n ranging from 2.75-3.75 give good results. 


n 

2.0 

2.25 

2.5 

2.75 

3.0 

3.25 

3.5 

3.75 

4.0 

Nd 

0 

0 

0 

1 

1 

2 

2 

2 

ra 

Ni 

26 

17 

9 


Bi 

1 

1 

1 

0 

R 

100 



96.66 

96.66 

93.33 

93.33 

93.33 


P 

53.57 

63.82 

76.92 

82.85 

87.87 

96.55 

96.55 

96.55 

100 


Table 6.2: Performance of cut detection for different n and m=10,L=7(Both Y&C) 


Table.6.3 illustrates the performance of cut detection for the same video when only lu- 
minance(Y) is used to detect cuts. From Tables 6.2 and 6.3 we can infer that for same values 
of n, the scene change detection algorithm performs better when both Y and C are used 
compared to using only Y. 

Table.6.4 summarizes the results of cut detection for different videos, for the case when 
n — ‘i,m = 10, and L = 7. ‘BdoutS.mpg’, ‘Bdout9.mpg’ are movie clips from the movie 
Baby’s day out. ‘FNl’ is movie clip from the animation movie Finding Nemo. ‘Martin’ is a 
music video of Ricky Martin. These four videos have different characteristics. Music videos 


Nt-Nd 

Nt 

(Nt - Nd) 

{{Nt - Nd) + Ni) 


( 6 . 2 ) 
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n 

2.5 

2.75 

3.0 

3.25 

3.5 

iVd 

3 

3 

3 

3 

3 . 

Ni 

12 

3 

3 

2 

2 

R 

90 

90 

90 

90 

90 

P 

69.23 

81.81 

90 

93.10 

93.10 


Table 6.3: Performance of cut detection for different n and m=10,L=7(Only Y) 


generally contain fast motion sequences, and video content changes very rapidly. Movies are 
generally of slow motion, and content changes are relatively less compared to music videos. 
Animation movies have different scene content compared to movies. Hence, these videos 
are taken for evaluation of cut detection algorithm. 


Video 

Total cuts 

Miss detections 

False detections 

Recall 

Precision 

Bdout3.mpg 

48 

3 

3 

93.75% 

93.75% 

Bdout9.mpg 

30 

1 

4 

96.67% 

87.88% 

FNi 

25 

4 

3 

84.00% 

87.50% 

Martin 

66 

4 

9 

93.94% 

87.32% 


Table 6.4: Performance of cut detection for different videos 


6.2.2 Performance of gradual scene change detection 

Two test videos are used to evaluate the performance of the gradual scene change detection 
algorithm. The first video has 4087 frames and 4 dissolves. The second video has 4952 
frames and 6 fades: 3 fade-in, 3 fade-out. First and second videos are used to evaluate the 
performance of dissolve and fade detection respectively. 
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6.2.2.1 Results of dissolve detection 


The first two plots in Fig.6.5 show the variation of mean and variance respectively from 
frame number 2740 to frame number 2780 of the first video. There is a dissolve starting at 
frame number 2753. The linear and quadratic behaviour of mean and variance respectively 
can be clearly seen in the figure. The third and fourth plots of the figure show the plots of 
differential values of mean and variance respectively. 

The value of F is taken as 1 0, where F is number of frames in the dissolve region given in 
eqn.3. 11. So it detects dissolves and fades of duration greater than or equal to 10 frames. To 
detect dissolves, at first differential variances of DC frames are checked and, if there is any 
pattern of dissolve, then it’s presence is confirmed by checking the means of corresponding 
frames. Fig.6.6 shows the variation of mean and variance respectively of another dissolve 
which starts at frame number 107. 



Figure 6.5; From top to bottom: Plots of mean, variance, differential mean, differential 
variance vr frame number 
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Figure 6.6: From top to bottom: Plots of mean, variance, differential mean, differential 
variance vs frame number 

6.2.2.2 Results of fade detection 

Fig.6.7 shows the behaviour of mean and variance respectively, from frame number 2100 to 
frame number 2200. There is a fade-out which starts at frame number 2107 and a fade-in 
region which starts at frame number 2126. From the figure it is clear that mean and variance 
have linear and quadratic properties respectively, and fade-out ends with a solid color frame 
with zero variance. Similarly fade-in starts with a solid color frame with zero variance. 
The same value of F = 10 is used to find the fades. Fig.6.8 shows the variations of mean 
and variance of another fade-out, fade-in pair which starts at frame numbers 3310 and 3328 
respectively. 

6.3 Performance evaluation of keyframe clustering 

In this section results of keyframe clustering algorithm are presented. The proposed keyframe 
clustering algorithm uses color layout descriptor for finding similarities between the keyframes 
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Figure 6.7; From top to bottom; Plots of mean, variance, differential mean, differential 
variance V5' frame number 
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Figure 6.8: From top to bottom; Plots of mean, variance, differential mean, differential 
variance frame number 
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of different shots. Color layout descriptor coefficients for all the keyframes are extracted as 
described in sec.4.2.2. Keyframe clustering is performed as described in sec.4.2.1. Every 
keyframe of the video is compared to the clustered keyframes and the best match is found. 
If the distance between the best match and the keyframe as given in eqn.4.2 is less than a 
threshold(e) then this keyframe is clustered with that best match. Table.6.5 shows the perfor- 
mance of keyframe clustering for different thresholds(e),where C represents the number of 
keyframes that are clustered correctly, while F represents the number of keyframes that are 
falsely clustered. These results are for a movie clip from the movie Baby’s day out which has 
49 keyframes. There are 26 keyframes which have visual similarity with other keyframes. 


Threshold(e) 

0.75 

1.0 

1.25 

1.5 

1.75 

C 

19 

21 

23 

25 

26 

F 

0 

0 

0 

1 

2 


Table 6.5; Performance for different thresholds(e) 


Figures 6.9-6. 1 1 show three different examples of keyframe clustering. In these figures 
all the frames are keyframes of different shots. The keyframes labeled ‘keyframe 1’ and 
‘keyframe2’ in each figure are clustered to the frame with label ‘best match’ in the same 
figure. Now the shots corresponding to ‘keyframel’ and ‘keyframe2’ can be represented 
with ‘best match’. 



best match keyframel keyframe2 


Figure 6.9: Example of keyframe clustering: 1 
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best match keyframel keyframe2 


Figure 6.10; Example of keyframe clustering; 2 



best match keyframel keyframeZ 


Figure 6.11; Example of keyframe clustering; 3 

Table.6.6 illustrates the performance for different videos. BdoutS.mpg, Bdout4.mpg, 
Bdout9.mpg are clips from the movie baby’s day out. These results are for e = 1.5. The 
average detection performance is better compared to the detection performance of 69.7% 
obtained in[16] while false detections are less than 8.86%. 



Table 6.6; Detection performance of keyframe clustering 
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6.4 Indexing 


Indexing follows keyframe clustering. Indices of clustered keyframes of the video are ex- 
tracted as described in sec.5.2. Indices of the keyframes are extracted both in YC domain 
and YCbCr domain, for comparing the performance of YC representation with 3D color 
representation. YC representation needs 9 coefficients to represent each keyframe, 6 for 
luminance(L) and 3 for color(C). Two of these 9 coefficients are DC coefficients, and the 
remaining 7 are AC coefficients. YCbCr representation requires 12 coefficients to represent 
each keyframe: 6 for Y, 3 for Cb, 3 for Cr. Among these 12 coefficients, 3 are DC coefficients 
and the remaining are AC coefficients. 

If the descriptor coefficients are coded with 6 bits for DC coefficients and 5 bits for AC 
coefficients [3], then each keyframe descriptor in YCbCr domain takes 63 bits to store, while 
each keyframe descriptor in YC domain takes only 47 bits. There is a reduction of 16 bits per 
descriptor, if keyframes are indexed in YC domain. This shows that storage cost in indexing 
videos in YC domain is significantly less compared to indexing videos in 3D YCbCr domain. 

6.5 Retrieval 

When a query is given to the system the indices of the query are extracted. These indices are 
compared to the indices of the keyframes stored in the database. Distance of the query from 
all the keyframes in the database is calculated. Those keyframes with least distance from the 
query and the corresponding video segments are retrieved. 

6.5.1 Comparison of retrieval performance of YC & Y CbCr 

Fig.6.12-6.19 shows the results of retrieval in both YC and YCbCr domain. In all these 
figures the frame with title ‘query’ is the query and other frames with titles ‘Rank’ are the 
retrievals with corresponding ranks. In all these results Query frame is retrieved in the first 
position, and other relevant frames are also retrieved. Retrieval results in YC domain are 
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comparable to that of YCbCr domain. The advantage of video retrieval in YC domain is that 
retrieval time of YC is less than that of YCbCr. For a database containing 936 keyframes, 
retrieval time in YCbCr domain is 0.0982 seconds, while retrieval time in YC domain is 

I 

0.0840 seconds. This amounts to a reduction of 14.46% in retrieval time. The difference 
between the retrieval times of YC and YCbCr increases as the size of database increases. 
These results demonstrate the advantage of video indexing and retrieval in two-dimensional 
YC domain. 
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Figure 6.12; QuerylrRetrieval in YC domain 

Query Rank:1 Rank:2 


Figure 6.13: Query liRetrieval in YCbCr domain 


Rank:3 


Rank:4 


Rank:5 
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Figure 6.15; Query2:Retrieval in YCbCr domain 
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Rank:4 


Figure 6.16: Query3;Retrieval in YC domain 


Figure 6.17: Query3:Retrieval in YCbCr domain 





Query 


Rank:1 


Rank:2 



Figure 6.18: Query4:Retrieval in YC domain 


Query Rank:1 Rank:2 



Figure 6. 19: Query4:Retrieval in YCbCr domain 
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6.6 Summary of Video Indexing and Retrieval 

In this section we give a summary of various steps involved in indexing and retrieval, and 
list the values of different parameters used in this thesis. 

6.6.1 Various steps in video indexing 

The following steps are involved in video indexing: 

Step] : The extraction of DC images of MPEG video. 

Step! : The scene change detection. All the DC images are mapped to YC domain. For 
spiral mapping, the value of L = 7 is used, where L represents the number of encirclements 
of spiral. Values of parameters for detecting abrupt scene changes are m = 10 and n = 3, 
where m and n are window length and multiplying factor respectively. The parameter F, 
which represents, the number of frames in dissolve region is a variable, and the values used 
are 10 for music videos, and 15 for other videos. 

Step3 ; Keyframe extraction of all the shots. Frames with minimum value of kurtosis are 
taken as keyframes. 

Step4 : Keyframe clustering for video indexing. The values of threshold(e) is taken as one. 
Lower e is used to avoid false alarms in keyframe clustering. 

StepS : Indexing of keyframes for video indexing- The indices are stored in the database 
along with the keyframes. All the above steps are followed to index the videos in the 
database. 


6.6.2 Steps in video retrieval 

When a query is posed to the system indices of the query are extracted. These indices are 
compared to the indices of the keyframes stored in the database. Frames that have least 
distance from the query are retrieved. 
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Chapter 7 


Conclusion and Scope for Future Work 

7.1 Conclusions 

In this thesis we proposed a novel video indexing and retrieval system based on two-dimensional 
representation of color. Mapping of color from 3D color representation(RGB) to 2D color 
representation(YC) is discussed. Experiments conducted using spiral mapping, indicate that 
it is a valid representation. The SNR calculations showed that for spiral approximation, 7 
encirclements of the spiral give a good representation of image in YC color space. Scene 
change detection algorithm is modified such that both abmpt and gradual scene changes can 
be detected using the same data. An algorithm for keyframe clustering is proposed. This 
algorithm efficiently reduces the redundancies in the keyframe database. Color layout de- 
scriptor is used for indexing the keyframes and retrieval of video segments. Experimental 
results show that indexing in YC domain has the advantage of less storage cost. If the in- 
dices are coded, each descriptor in YC domain takes 47 bits, while each descriptor in YCbCr 
domain takes 63 bits. So there is a reduction of 16 bits per descriptor in YC domain. Re- 
trieval time in YC domain is less than retrieval time in YCbCr domain. For a database of 
936 keyframes, the reduction in retrieval time is found to be 14.46%. The results obtained 
confirm the validity of YC representation for video indexing and retrieval. 
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